Homework 6
Below is a table listing the probabilities of three binary random variables. In the empty table cells, fill in the correct values for each marginal or conditional probability. Round your answers to 3 decimal places.
Consider the gridworld where Left and Right actions are successful 100% of the time. Specifically, the available actions in each state are to move to the neighboring grid squares. From state a, there is also an exit action available, which results in going to the terminal state and collecting a reward of 10. Similarly, in state e, the reward for the exit action is 4. Exit actions are successful 100% of the time.
Let the discount factor γ = 0.5. Fill in the following quantities.
We will consider a simple MDP that has six states, A, B, C, D, E, and F. Each state has a single action, go. An arrow from a state x to a state y indicates that it is possible to transition from state x to next state y when go is taken. If there are multiple arrows leaving a state x, transitioning to each of the next states is equally likely. The state F has no outgoing arrows: once you arrive in F, you stay in F for all future times. The reward is one for all transitions, with one exception: staying in F gets a reward of zero. Assume a discount factor = 0.7. We assume that we initialize the value of each state to 0. (Note: you should not need to explicitly run value iteration to solve this problem.)
Consider the following transition diagram, transition function and reward function for an MDP.
Suppose we are doing policy evaluation, by following the policy given by the left-hand side table below. Our current estimates (at the end of some iteration of policy evaluation) of the value of states when following the current policy is given in the right-hand side table.
Answers should keep 3 decimals.
Consider an MDP with 3 states, A, B and C; and 2 actions Clockwise and Counterclockwise. We do not know the transition function or the reward function for the MDP, but instead, we are given with samples of what an agent actually experiences when it interacts with the environment (although, we do know that we do not remain in the same state after taking an action).
In this problem, instead of first estimating the transition and reward functions, we will directly estimate the Q function using Q-learning. Assume, the discount factor, γ is 0.5 and the step size for Q-learning, α is 0.5. Our current Q function, Q(s,a), is as follows.
Answers should keep 3 decimals.
In general, for Q-Learning to converge to the optimal Q-values.
Mark the right sentences about TD-learning and Q-learning.
For the following action-selection method, indicate which option describes it best.
With probability p=0.99, select argmaxaQ(s,a). With probability 1- p, select a random action.
For the following action-selection method, indicate which option describes it best.
Select action a with probability
where τ is a temperature parameter that is decreased over time.
Although we have many iteration methods for solving MDP, we can also solve MDPs using expectimax search.
When learning with epsilon-greedy action selection, it is a good idea to decrease epsilon to 0 with time.
In terms of feature based Q learning, it will always find the same optimal Q* as which would be returned by a tabular Q learning.